Helpful Information
 
 
Category: Post a PHP snippet
Regex || regular expressions (postem here)

http://rlemon.com/TESTING/test_url_search.php <- temp url to the script.

the script itself is a simple preg_match_all however i have come up with a nice piece of REGEX that will return all urls on the page. great for parsing say, forum posts......

here is the expression:


(\b[a-zA-Z0-9]+://[^( |\>)]+\b)


it can be used like:




$subject = file_get_contents("./path/to/file.html"); // any string
$search = '(\b[a-zA-Z0-9]+://[^( |\>)]+\b)';

preg_match_all($search, $subject, $matches);

print_r($matches);

Regex scares (with few exceptions) the best and worst of us, lots of regex resources on the web do not always translate directly to PHP, though most do.

Anyway , if you have a good/useful PHP regular expression please post it here and do not start a seperate thread unless your snippet is more involved.

Please don't simply post the regex/pattern.
Give at least 1 example of it in action (if you don't the post gets removed~)

A good start from rlemon follows.... errr OK I stuffed up the thread merge ;) so the snippet preceeds !

Even your rlemon.com example shows some errors:


Array
(
[0] => Array
(
[0] => http://www.rlemon.com
[1] => http://rlemon.org<br
[2] => ftp://[email protected]
[3] => ftp://[email protected]</a
)

)

Here's an URL match that passes the rlemon test. Still not perfect, but workable. :)


preg_match_all('/([a-zA-Z]{2,8}:\/\/[a-zA-Z0-9\.\[email protected]]{2,}\.[a-zA-Z0-9]{1,6}.*?)([^a-zA-Z0-9%@\.\/\?&=]|$)/', $text, $matches);


The important results will be in $matches[1]

my expression passed in regexCoach(3rd party app) and onscreen (not viewing source in FF) it appeared to work, hencewhy i posted - i'm reviewing the expression tomorrow (i don't code on weekends anymore :P) and hopefully i can fix it. :cool:

I actually found this one in O'Rly's "Perl Best Practices" (not as an example of Perl best practices, to be fair) attributed to someone who goes by the name Abigail who I guess is famous in certain of the nerdier circles, and after nearly breaking my brain trying to figure out how it works I decided to share this method for determining if a number is prime:



function isPrime($n)
{
return is_int($n) && !preg_match('/^ (?: 1? | (11+?) \1+) $/xms', str_repeat('1', $n));
}


Brilliant and horrible, like so much of the canonical perl idiom.

edit: An example application, rather useless but demonstrative:


# Prints all primes in the loop range
for ($i = 0; $i < 1000; ++$i)
{
if (isPrime($i))
{
echo "$i\n";
}
}

I found this tutorial very useful in writing regex - http://www.regular-expressions.info/

Lol, regex started out as my worst enemy, and still remains challenging, yet I definitely enjoy it more now! :D When I first saw the quote in my sig, I found it very appropriate! ;)

Here's my URL parsing regex from developing my bbCode interpreter:

$re = "#[a-z]+?://[^<>\"\s]*[^\s.!?<>#@()\"]#i";I try my best not to match any characters that might directly follow the URL.

Concerning the isPrime regex, that thing has practically made my nose bleed! It looks like they're using conditional regex, but I still don't follow the reasoning. :-| :(

Your regex has issues, Curtis. The most glaring is that the dot in the final character class has special meaning. It's a wildcard, matching either any character or any character besides newlines depending on the trailing flags, which really makes a lot of difference in the end meaning. You can use the preg_quote() function on character classes you would like to be treated literally to ensure special regex rules are bypassed. Eg:



echo preg_quote(".!?<>#@()\"");
# Output is: \.\!\?\<\>#@\(\)"

The dot meta-character, in fact, has no special meaning in character classes, according to the following test.

<?php
echo preg_match("/[.]/", 'a') ? 'Match' : 'Not found'; // OUTPUTS: Not found
echo preg_match("/[.]/", '.') ? 'Match' : 'Not found'; // OUTPUTS: Match
?>
Also, the following is from the PHP manual on Pattern Syntax (http://php.net/manual/en/reference.pcre.pattern.syntax.php)

Part of a pattern that is in square brackets is called a "character class". In a character class the only meta-characters are:

\

general escape character
^

negate the class, but only if the first character
-

indicates character range
]

terminates the character classThe dot meta-character is not listed.

However, I did spot an error in my code. I didn't escape the regex delimiter, in this case, #. I overlooked that it needed to be escaped. I just tested this, and several variations.

$re = "#[a-z]+?://[^<>\"\s]*[^\s.!?<>\#@()\"]#i";
$string = 'Here\'s a link: <http://www.google.com>. Some more words.';
echo preg_replace($re, '<a href="$0" title="Link!">$0</a>', $string);

You're right, my bad. :)

You're right, my bad. :)Lol! I was scared for a bit, and triple checked :p :rolleyes:

I hope I'm not opening Pandora's box here, but I figured I might share my E-Mail validation regex. It's pretty lengthy (maybe even unnecessary??), but I have yet to spot a hole in it, although I only subjected it to the different E-Mail formats I knew. Lol, here she is (modified to ignore whitespace for readability):

/^
# user name
(?:[a-z0-9_-]+?\.)*?
[a-z0-9_-]+?
# separates user from domain
@
# sub.domain(s); if present
(?:[a-z0-9_-]+?\.)*?
# domain portion before TLD
[a-z0-9_-]+?
# dot before TLD
\.
# TLD match
[a-z0-9]{2,5}
$/ixNote: (?: ... ) is for non-capturing matching.

Here's a possible application for this regex:

<?php
$email = '[email protected]';
echo 'This E-Mail is <strong>' . (checkEmail($email) ? 'valid' : 'not valid') . '</strong>';

// Create function
function checkEmail($email) {
$re = "/^(?:[a-z0-9_-]+?\.)*?[a-z0-9_-][email protected](?:[a-z0-9_-]+?\.)*?[a-z0-9_-]+?\.[a-z0-9]{2,5}$/i";
return preg_match($re, $email); // returns true on match, false on failure
}
?>
You could expand this so that it captures each portion of an E-Mail, for whatever reason. See php.net (http://php.net)'s PCRE manual page (http://php.net/pcre) for more info on different functions to use.

This is somthing I made becouse I was a little bored. It should match any valid URL. That implies that it may match some invalid ones but I've tried to reduce the number of false matches to a minimum.


<?php
$regex = '~(?>[a-z+]{2,}://|www\.)(?:[a-z0-9]+(?:\.[a-z0-9]+)[email protected])?(?:(?:[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])(?:\.[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])+|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?:/[^\\/:?*"<>|\n]*[a-z0-9])*/?(?:\?[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?(?:&[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?)*)?(?:#[a-z0-9_%.]+)?~i';
?>

Sample URLs it matches:


http://www.php.net/manual/en/language.types.string.php#language.types.string.conversion
http://www.google.com/search?client=opera&rls=en&q=sample&sourceid=&ie=utf-8&oe=utf-8
http://66.249.93.104/search?q=cache:bCQPHS_h08gJ:stardust.jpl.nasa.gov/+sample&hl=en&ct=clnk&cd=4&client=
http://zdrowie.onet.pl/1340860,2039,0,1,,ortoreksja_czyli_obsesja,profilaktyka.html
https://www.rlemon.com
http://rlemon.org
ftp://[email protected]
ftps://[email protected]
www.codingforums.com (lazy "www." match instead of protocol)
svn+ssh://something.net/repository

The regex won't match invalid domains or ip addresses (or shouldn't match them).
I bet there is some error there :p

LOL!! You're still alive after that!! I don't think I could read through it all without separating bits of it (using /x modifier). That's very awesome! :) I can already see you thought about URLs that never once entered my mind. :rolleyes:

I hope I'm not opening Pandora's box here, but I figured I might share my E-Mail validation regex. It's pretty lengthy (maybe even unnecessary??), but I have yet to spot a hole in it, although I only subjected it to the different E-Mail formats I knew.

Not intending to put down your work at all, but email validity regex is notoriously difficult because most people write them the same way you did, ie: "I only subjected it to the different E-Mail formats I knew". The Internet Message Standard (RFC 2822 (http://rfc.net/rfc2822.html)) outlines what truly is a valid email address.
For an email regex that conforms to RFC 2822 and relevant discussion on the matter, Check out this blog post (http://www.ilovejackdaniels.com/php/email-address-validation/) :)

Thanks, I aready was aware of the insignificance of my regex, and for the very same reasons you outlined. I guess I just wanted to share what little I could. Most of my regexes are too specific to a task to be useful.

Thanks for pointing that out though, I would never have thought of referencing the RFC :p

Marek_mar, I can only find an error when I use your url extractor in a preg_replace function. I use that to replace urls in the text to HTML <a..> links.

In preg_replace it works fine as long as the link is terminated with a \r. Example: the following preg_replace, using your pattern, works fine:

$subject = "Follow-ups at http://www.mysite.com/alles.php.
To see samples http://www.asdamcp.dom/galleries/220/.
and http://www.mycom.nl.";

$search = '~(?>[a-z+]{2,}://|www\.)(?:[a-z0-9]+(?:\.[a-z0-9]+)[email protected])?(?:(?:[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])(?:\.[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])+|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?:/[^\\/:?*"<>|\n]*[a-z0-9])*/?(?:\?[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?(?:&[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?)*)?(?:#[a-z0-9_%.]+)?~i';
$subject = preg_replace($search, "<a href=\"\\0\">\\0</a>", $subject);
echo $subject;
But when the string is 'one-piece' like this:

$subject = "Follow-ups at http://www.mysite.com/alles.php. To see samples http://www.asdamcp.dom/galleries/220/. and http://www.mycom.nl.";
it screws up.
Can you suggest a solution to this?

A (late) fix for the URL matching regex.


$regex = '~(?>[a-z+]{2,}://|www\.)(?:[a-z0-9]+(?:\.[a-z0-9]+)[email protected])?(?:(?:[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])(?:\.[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])+|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?:/[^\\/:?*"<>|\n ]*[a-z0-9])*/?(?:\?[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?(?:&[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?)*)?(?:#[a-z0-9_%.]+)?~i';

One character was added :)

Thanks very much.

Ronald :cool:

moderators may not like in me. We'll seeee.



moderators don't like swearing in public forums, which is a shame since your post was otherwise useful










privacy (GDPR)